Step 1

PM2.5 California Data for 2004

data04 <- read.csv("pm25ca2004.csv") # Load in data
dim(data04) # Check dimensions
## [1] 19233    20
names(data04) # Check variable names
##  [1] "Date"                           "Source"                        
##  [3] "Site.ID"                        "POC"                           
##  [5] "Daily.Mean.PM2.5.Concentration" "UNITS"                         
##  [7] "DAILY_AQI_VALUE"                "Site.Name"                     
##  [9] "DAILY_OBS_COUNT"                "PERCENT_COMPLETE"              
## [11] "AQS_PARAMETER_CODE"             "AQS_PARAMETER_DESC"            
## [13] "CBSA_CODE"                      "CBSA_NAME"                     
## [15] "STATE_CODE"                     "STATE"                         
## [17] "COUNTY_CODE"                    "COUNTY"                        
## [19] "SITE_LATITUDE"                  "SITE_LONGITUDE"
str(data04) # Check variable types
## 'data.frame':    19233 obs. of  20 variables:
##  $ Date                          : chr  "01/01/2004" "01/02/2004" "01/03/2004" "01/04/2004" ...
##  $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
##  $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
##  $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Daily.Mean.PM2.5.Concentration: num  8.9 12.2 16.5 18.1 11.5 32.5 14 29.9 21 16.9 ...
##  $ UNITS                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
##  $ DAILY_AQI_VALUE               : int  37 51 60 64 48 94 55 88 70 61 ...
##  $ Site.Name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
##  $ DAILY_OBS_COUNT               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PERCENT_COMPLETE              : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ AQS_PARAMETER_CODE            : int  88101 88502 88502 88101 88502 88502 88101 88502 88502 88502 ...
##  $ AQS_PARAMETER_DESC            : chr  "PM2.5 - Local Conditions" "Acceptable PM2.5 AQI & Speciation Mass" "Acceptable PM2.5 AQI & Speciation Mass" "PM2.5 - Local Conditions" ...
##  $ CBSA_CODE                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
##  $ CBSA_NAME                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
##  $ STATE_CODE                    : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ STATE                         : chr  "California" "California" "California" "California" ...
##  $ COUNTY_CODE                   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ SITE_LATITUDE                 : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ SITE_LONGITUDE                : num  -122 -122 -122 -122 -122 ...

PM2.5 California Data for 2019

data19 <- read.csv("pm25ca2019.csv")
dim(data19)
## [1] 53086    20
head(data19)
tail(data19)
names(data19)
##  [1] "Date"                           "Source"                        
##  [3] "Site.ID"                        "POC"                           
##  [5] "Daily.Mean.PM2.5.Concentration" "UNITS"                         
##  [7] "DAILY_AQI_VALUE"                "Site.Name"                     
##  [9] "DAILY_OBS_COUNT"                "PERCENT_COMPLETE"              
## [11] "AQS_PARAMETER_CODE"             "AQS_PARAMETER_DESC"            
## [13] "CBSA_CODE"                      "CBSA_NAME"                     
## [15] "STATE_CODE"                     "STATE"                         
## [17] "COUNTY_CODE"                    "COUNTY"                        
## [19] "SITE_LATITUDE"                  "SITE_LONGITUDE"
str(data19) 
## 'data.frame':    53086 obs. of  20 variables:
##  $ Date                          : chr  "01/01/2019" "01/02/2019" "01/03/2019" "01/04/2019" ...
##  $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
##  $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
##  $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Daily.Mean.PM2.5.Concentration: num  5.7 11.9 20.1 28.8 11.2 2.7 2.8 7 3.1 7.1 ...
##  $ UNITS                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
##  $ DAILY_AQI_VALUE               : int  24 50 68 86 47 11 12 29 13 30 ...
##  $ Site.Name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
##  $ DAILY_OBS_COUNT               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PERCENT_COMPLETE              : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ AQS_PARAMETER_CODE            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
##  $ AQS_PARAMETER_DESC            : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
##  $ CBSA_CODE                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
##  $ CBSA_NAME                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
##  $ STATE_CODE                    : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ STATE                         : chr  "California" "California" "California" "California" ...
##  $ COUNTY_CODE                   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ SITE_LATITUDE                 : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ SITE_LONGITUDE                : num  -122 -122 -122 -122 -122 ...

Checking The Key Variable from Each Data Set

summary(data04) # Summarize variables to gain insight
##      Date              Source             Site.ID              POC        
##  Length:19233       Length:19233       Min.   :60010007   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:60370002   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median :60658001   Median : 1.000  
##                                        Mean   :60588026   Mean   : 1.816  
##                                        3rd Qu.:60750006   3rd Qu.: 2.000  
##                                        Max.   :61131003   Max.   :12.000  
##                                                                           
##  Daily.Mean.PM2.5.Concentration    UNITS           DAILY_AQI_VALUE 
##  Min.   : -0.10                 Length:19233       Min.   :  0.00  
##  1st Qu.:  6.00                 Class :character   1st Qu.: 25.00  
##  Median : 10.10                 Mode  :character   Median : 42.00  
##  Mean   : 13.13                                    Mean   : 46.33  
##  3rd Qu.: 16.30                                    3rd Qu.: 60.00  
##  Max.   :251.00                                    Max.   :301.00  
##                                                                    
##   Site.Name         DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
##  Length:19233       Min.   :1       Min.   :100      Min.   :88101     
##  Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
##  Mode  :character   Median :1       Median :100      Median :88101     
##                     Mean   :1       Mean   :100      Mean   :88267     
##                     3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
##                     Max.   :1       Max.   :100      Max.   :88502     
##                                                                        
##  AQS_PARAMETER_DESC   CBSA_CODE      CBSA_NAME           STATE_CODE
##  Length:19233       Min.   :12540   Length:19233       Min.   :6   
##  Class :character   1st Qu.:31080   Class :character   1st Qu.:6   
##  Mode  :character   Median :40140   Mode  :character   Median :6   
##                     Mean   :35328                      Mean   :6   
##                     3rd Qu.:41860                      3rd Qu.:6   
##                     Max.   :49700                      Max.   :6   
##                     NA's   :1253                                   
##     STATE            COUNTY_CODE        COUNTY          SITE_LATITUDE  
##  Length:19233       Min.   :  1.00   Length:19233       Min.   :32.63  
##  Class :character   1st Qu.: 37.00   Class :character   1st Qu.:34.07  
##  Mode  :character   Median : 65.00   Mode  :character   Median :36.48  
##                     Mean   : 58.63                      Mean   :36.23  
##                     3rd Qu.: 75.00                      3rd Qu.:38.10  
##                     Max.   :113.00                      Max.   :41.71  
##                                                                        
##  SITE_LONGITUDE  
##  Min.   :-124.2  
##  1st Qu.:-121.6  
##  Median :-119.3  
##  Mean   :-119.7  
##  3rd Qu.:-117.9  
##  Max.   :-115.5  
## 
summary(data04$Daily.Mean.PM2.5.Concentration) # Summarize key variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.10    6.00   10.10   13.13   16.30  251.00
sum(is.na(data04)) # Check if data has any missing values
## [1] 1253
sum(is.na(data04$Daily.Mean.PM2.5.Concentration)) # Check key variable too
## [1] 0
summary(data19)
##      Date              Source             Site.ID              POC        
##  Length:53086       Length:53086       Min.   :60010007   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:60310004   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median :60612003   Median : 3.000  
##                                        Mean   :60565291   Mean   : 2.562  
##                                        3rd Qu.:60771002   3rd Qu.: 3.000  
##                                        Max.   :61131003   Max.   :21.000  
##                                                                           
##  Daily.Mean.PM2.5.Concentration    UNITS           DAILY_AQI_VALUE 
##  Min.   : -2.200                Length:53086       Min.   :  0.00  
##  1st Qu.:  4.000                Class :character   1st Qu.: 17.00  
##  Median :  6.500                Mode  :character   Median : 27.00  
##  Mean   :  7.733                                   Mean   : 30.56  
##  3rd Qu.:  9.900                                   3rd Qu.: 41.00  
##  Max.   :120.900                                   Max.   :185.00  
##                                                                    
##   Site.Name         DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
##  Length:53086       Min.   :1       Min.   :100      Min.   :88101     
##  Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
##  Mode  :character   Median :1       Median :100      Median :88101     
##                     Mean   :1       Mean   :100      Mean   :88214     
##                     3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
##                     Max.   :1       Max.   :100      Max.   :88502     
##                                                                        
##  AQS_PARAMETER_DESC   CBSA_CODE      CBSA_NAME           STATE_CODE
##  Length:53086       Min.   :12540   Length:53086       Min.   :6   
##  Class :character   1st Qu.:31080   Class :character   1st Qu.:6   
##  Mode  :character   Median :40140   Mode  :character   Median :6   
##                     Mean   :35841                      Mean   :6   
##                     3rd Qu.:41860                      3rd Qu.:6   
##                     Max.   :49700                      Max.   :6   
##                     NA's   :4181                                   
##     STATE            COUNTY_CODE        COUNTY          SITE_LATITUDE  
##  Length:53086       Min.   :  1.00   Length:53086       Min.   :32.58  
##  Class :character   1st Qu.: 31.00   Class :character   1st Qu.:34.14  
##  Mode  :character   Median : 61.00   Mode  :character   Median :36.63  
##                     Mean   : 56.39                      Mean   :36.35  
##                     3rd Qu.: 77.00                      3rd Qu.:37.97  
##                     Max.   :113.00                      Max.   :41.76  
##                                                                        
##  SITE_LONGITUDE  
##  Min.   :-124.2  
##  1st Qu.:-121.6  
##  Median :-119.8  
##  Mean   :-119.8  
##  3rd Qu.:-118.1  
##  Max.   :-115.5  
## 
summary(data19$Daily.Mean.PM2.5.Concentration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.200   4.000   6.500   7.733   9.900 120.900
sum(is.na(data19))
## [1] 4181
sum(is.na(data19$Daily.Mean.PM2.5.Concentration))
## [1] 0

Based on the summary of our key variable Daily.Mean.PM2.5.Concentration, minimum values reach below values of zero, which is not possible for something like particulate matter concentration. Because of this, we will also run extra code that removes these instances from our data sets. Furthermore, there was no missing data for Daily.Mean.PM2.5.Concentration, so there is no need for removal of missing data.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ readr   1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(data.table)
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
## 
##     transpose
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
data04 <- data04 %>%
    filter(Daily.Mean.PM2.5.Concentration > 0)
summary(data04$Daily.Mean.PM2.5.Concentration) # Recheck minimum value
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.10    6.00   10.10   13.14   16.30  251.00
setorder(data04, Daily.Mean.PM2.5.Concentration) # Order by PM conc.
head(data04) # Check headers after ordering
tail(data04) # Check footers after ordering
data19 <- data19 %>%
    filter(Daily.Mean.PM2.5.Concentration > 0)
summary(data19$Daily.Mean.PM2.5.Concentration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.10    4.10    6.50    7.79   10.00  120.90
setorder(data19, Daily.Mean.PM2.5.Concentration)
head(data19)
tail(data19)

After filtering out negative concentration values and ordering the data set, head() and tail() confirmed that the data removed negative values and is ordered. The data can now be deemed ready for analysis—but the next step will combine our two frames.

Step 2

Combining the Two Years of Data

dataMerged <- merge(x = data04,y = data19, all=TRUE) # Merge two sets
dataMerged$Date <-  as.Date(dataMerged$Date,"%m/%d/%Y") #Reformat as date
dataMerged$Year <- as.numeric(format(dataMerged$Date,'%Y')) # Extract year
summary(dataMerged$Year) # Confirm new column
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2004    2004    2019    2015    2019    2019

Changing Names of Variables

dataMerged <- 
  rename(dataMerged, 
         dailyConc = Daily.Mean.PM2.5.Concentration, # Renaming of column structure
         lat = SITE_LATITUDE,
         long = SITE_LONGITUDE,
         siteName = Site.Name,
         siteID = Site.ID)

Step 3

Locations of Sites Using leaflet()

library(leaflet)

temp.pal <- colorFactor(c("#A3C41B", "#438A7D"), domain = dataMerged$Year) # Palette creation
  
leaflet(dataMerged) %>% # Chunk adapted from lecture and online examples of leaflet()
  addProviderTiles('CartoDB.Positron') %>%
  addCircles(
    lat = ~lat, lng = ~long, color = ~temp.pal(Year),
    opacity = 1, fillOpacity = 1, radius = 100) %>%
  addLegend('bottomleft', pal=temp.pal, values=dataMerged$Year,
            title='Temperature, C', opacity=1)

Based on the produced map, there are clearly more sites in 2019 than 2004, which is also evident with the earlier dimension checks of the respective data sets. This could be because of an increasing importance of monitoring pollutant data and related funding over the fifteen year gap. Additionally, the number of sites appear to be related to population density, with largely populous areas like the San Francisco Bay Area, Greater Los Angeles area, and San Diego having the most sites. This makes sense in many regards. Because pollution often follows people (transportation, polluting habits, etc.), reporting on places that likely have the most environmental damage make sense. Furthermore, if such reporting can be relevant to more residents (population density), then it would be strategic to understand an area’s pollution more. Overall, urban areas are well-represented in this combined data set.

Step 4

Checking for Missing Values of PM2.5 in Combined Data Set

sum(is.na(dataMerged$dailyConc)) # Count if there are missing values
## [1] 0
mean(is.na(dataMerged$dailyConc)) # Average the missing values, if any
## [1] 0

It appears there are no missing values in the data set, for these were removed earlier in the data preparation stage of the assignment.

Checking for Implausible Values of PM2.5 in Combined Data Set

setorder(dataMerged, -dailyConc) # Arrange by decreasing order to view high anomalies
dataMerged %>%
  select(Date, dailyConc, DAILY_AQI_VALUE, siteName) # Show specific data wanted to view

Extremely small concentrations of PM2.5 are possible, making them plausible for any given pollution data set. Efforts were then focused on high concentrations to ensure they were not implausible or lying outside of expected levels. By ordering daily concentrations of PM2.5 in decreasing order, the largest concentration recordings were shown at the top. The two highest measurements (251.0 ug/m3 LC and 170.4 ug/m3 LC) happened in Yosemite during the summer of 2004. AQI was similarly high on the same measurements (301 and 221, respectively), which gives more evidence that the numbers may be valid. Doing some external internet searches revealed that wildfires are frequent in the area, and AQI between 301-500 is for wildfires while PM2.5 concentration above 250 is as well ( EPA below). For this reason, implausibility seems less likely for these two instances.

Step 5

Exploratory Plots and Summary Statistics by State

library(ggplot2)
# Histogram
ggplot(dataMerged, aes(x = dailyConc)) +
  geom_histogram(bins = 150, fill = "gray", color = "navy") +
  facet_wrap(. ~ Year) +
  xlim(0,100)
## Warning: Removed 10 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).

# Boxplot
ggplot(dataMerged, aes(x = dailyConc)) +
  geom_boxplot(color = "navy") +
  facet_wrap(. ~ Year)

# Line plot
ggplot(dataMerged, aes(y = dailyConc, x = Date)) +
  geom_line(color = "navy") +
  facet_wrap(scales = "free", . ~ Year)

# Summarize
dataMerged %>%
  group_by(Year) %>%
  summarise(mean = mean(dailyConc), # Various summary stats
            median = median(dailyConc),
            sd = sd(dailyConc),
            min = min(dailyConc),
            max = max(dailyConc),
            IQR = IQR(dailyConc))

Because this merged data set solely holds data from California (as marked when downloaded), no filtering or sorting by a geographic-related variable was needed. Therefore, only so many figures were produced from this specific spatial level. Regardless, there are definitely takeaways when viewing the difference between California in 2004 and 2019:

Exploratory Plots and Summary Statistics by County

# Plot to compare counties among each other and between 2004 and 2019
ggplot(dataMerged) +
  geom_point(mapping = aes(x = COUNTY, y = dailyConc, colour = factor(Year))) +
  scale_color_manual(values=c("#A3C41B", "#438A7D")) +
  labs(x = "County", y = "PM2.5 Concentration (ug/m^3 LC)") +
  theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 5))

# Summary
dataMerged %>%
  group_by(COUNTY, Year) %>% # COUNTY before year to see same county data side-by-side
  summarise(mean = mean(dailyConc),
            median = median(dailyConc),
            sd = sd(dailyConc),
            min = min(dailyConc),
            max = max(dailyConc),
            IQR = IQR(dailyConc))
## `summarise()` has grouped output by 'COUNTY'. You can override using the `.groups` argument.

Based on the output when stratifying by county, it appears that majority of counties experienced an overall decrease in PM2.5 concentration between 2004 and 2019. Some appear to be anomalies, like San Joaquin and Tehama counties. This may be because some areas experienced an actual increase in pollutants or may not have been reporting in the year 2004 to begin with. Additionally, this gives us more insight to compare counties, and the figure demonstrates that Mariposa County experienced dramatically high PM2.5 concentration in 2004 compared to other counties, while Lake County has had the lowest PM2.5 concentration to date among all CA counties.

The statistical summary output confirms the proposal that majority of counties reduced their average PM2.5 concentration and provides more insight into how many also reduced their respective maximum measurements, IQRs, medians, and even standard deviations. Overall, the reduction in pollutant concentration is evident.

Exploratory Plots and Summary Statistics by site in Los Angeles

# Plot to compare counties among each other and between 2004 and 2019
dataLA <- 
  dataMerged %>%
  filter(COUNTY == "Los Angeles") # Take out other counties' data

ggplot(dataLA) +
  geom_point(mapping = aes(x = siteName, y = dailyConc, colour = factor(Year))) +
  scale_color_manual(values=c("#437B8A", "#F8C228")) +
  labs(x = "LA Site Name", y = "PM2.5 Concentration (ug/m^3 LC)") +
  theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 5))

# Summary
dataLA %>%
  group_by(siteName, Year) %>%
  summarise(mean = mean(dailyConc),
            median = median(dailyConc),
            sd = sd(dailyConc),
            min = min(dailyConc),
            max = max(dailyConc),
            IQR = IQR(dailyConc))
## `summarise()` has grouped output by 'siteName'. You can override using the `.groups` argument.

Based on the LA-specific figure, it is clear that some sites were not around in the year 2004, with only eight sites having existed in both of the investigated years. Those eight all experienced decreases in PM2.5 concentration, and the other sites all had relatively low measurementsin 2019 regardless.

Much like the stratified county data, the summary statistics give quantified information as to how much each site changed in the fifteen year span. Not including the maximum value of the Reseda site, each statistic decreased from 2004 to 2019 for sites that had information for both. Overall, it appears Los Angeles County in particular has also experienced a decrease in particular matter concentrations.